[Python for data analysis] - Diabetes Dataset

Introduction

This dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.

The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.

Dataset can be found at https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008


Table Of Contents


Data Exploration

Importation of all the librairies that we will use

Random State

We define a seed to have reproducible results.

Display settings

We set the bokeh output to notebook

We define a permanent style for the plot in order to have an homogenous vizualisation

All bokeh and plotly graphs are dynamic, so you can move the mouse over an element to see which data he refers to.
You can also interact with the legend of the plot to show/hide some categories

We set the maximum number of columns allowed to be displayed to 100 so that the all dataframe can be printed

Importation of the dataset

We create a dataframe by reading a csv file and importing the data

By fisrt obersving the data, we can see that there are a lot of values in this dataset. There are 101 766 row and 50 columns. We will certainly have to analyze and clean up these data a little to reduce them and make them more treatable.

First overview

Let's use the info() method in order to observe the structure and the type of the dataframe

Columns signification

Now we want to know and understand the signification of the differents columns in order to analyse them. With this dataset, we have a pdf document explaining some characteristics of these data:

Here, we will use the pandas_profiling library to call the profil_report() function. This library extends the DataFrame for a quick analysis

This report contains several informations such as the types of columns in the dataframe, the uniques and missing values, it will do some quantile and descriptive statistics, will search some correlations between the differents features. We will use it as a base to clean up our data.

Profile report

Data cleaning

Counting unknown values

Here we count the pourcentage of missing value in the differents columns of this dataframe

Dropping columns with too much missing values

With the precedent results, we eliminate the columns that have too many missing values

Some columns provide a constant value for every row, it is useless so we can eliminate them too.

Dropping redundant rows

When we analysed some data in this dataset, we observed that there was some redundant rows, some rows had the same number of patient value, and sometimes the linked variables were not coherent

We also drop these rows

Because of the deletion of redundant rows, some index were erased, we need to reset index

Rethink some colums format

Diag_1, Diag_2, Diag_3: mapping ICD9 codes with corresponding names

https://www.hindawi.com/journals/bmri/2014/781670/tab2/

We create a dictionnary in order to map the value later, this dictionnary correspond to the values found in the link above

This mapping method replace all the value by their corresponding in order to have traitable data. The numeric values in the column diag1, diag2, diag3 were too scattered, we had to regroup them by using the informations in the link above

Now we apply the mapping method to the three differents columns

Now we want to create new columns to interpret the differents diagnosis of the patient, to do so, we first create a method that will give boolean value for the existence of the different diagnosis

Now we map this value into new columns to interpret later with graphs and learning models

Rethink some other columns

We want to see what are the actual differents names of the columns of this dataset

We can map another columns to have a better idea of the different admission type in the later graphs. We use here the mapping values gave in a document with the dataset


Data Visualization

Correlation Matrix

A correlation matrix is used to assess dependency between several variables at the sasme time. The result is a table containing the correlation coefficients between each variable and the other.

The closer the coefficient (between 0 and 1) is to 1, the more correlated are the two associated variables.

Now we will use another library (seaborn) in order to plot the correlation matrix with another visual

By interpreting these correlations matrix, we can observe that this dataset does not present many correlations between its different variables. The coefficients are relatively low and the highest ones remain small coefficients.

The variables who seem to have a high correlation are :

  • num_lab_procedures, num_procedures and num_medications with time_in_hospital
  • num_medications with num_lab_procedures
  • num_medications with num_procedures
  • number_diagnoses with time_in_hospital and num_lab_procedures and Diag_Diabetes
  • Diag_Neoplams with Diag_Circulatory
  • Diag_Digestive with Diag_Circulatory

Boxplots

Now, using the plotly library, we can plot some boxplots.

We select the different columns relevent to boxplot

Let's put some colors on that notebook : )

Distributions

Using bokeh, we can plot some graphs that will be more interesting with the preprocessing above
Each feature will first be analysed individually

Again, let's put some colors on that notebook : )

We create a list of all columns that we want to visualize

Deeper analysis

In this section, we will try to analyse links between features see the influence from one another.

We can see here that there is a relationship between the age of the patient and the time he spent in hospital. On average, the older patients spend more time in hospital than the youngers.

Here, the number of distinc medication seems to depend a lot on the patient's age with a peak for patients around 60-70 years old.
Once again, older patients seems to be more affected.

We can see that the Yes and No values follow the same pattern with many values for patients between 50 and 90 years old but this can simply be explained by the fact that the pattern follows the distribution of patient of this dataset.
But here we can deduce that for a class of age, almost 2 patient over 3 were given diabetic medication

Here, we can see that the green dots (the patients readmitted under 30 days seem to be regrouped at the center of the plot


Principal Component Analysis

from sklearn.decomposition import PCA pca = PCA() x_train_pca = pca.fit_transform(x_train) x_test_pca = pca.transform(x_test)colonnes = list(diabetes_df_ml.columns) colonnes.remove('readmitted')colors = ['red' if x == 1 else 'blue' for x in y_train] #markers = ['.' if x == 1 else '.' for x in y_train] pd.DataFrame(x_train_pca).plot(kind = 'scatter', x = 1, y = 0, c = colors, marker = ".",s = 2, figsize = (20,20))

Here, we tried to run a principal component analysis to find out which feature could explain the most the data
Unfortunately, we didn't come close to a good solution
We hoped that by plotting the dataset on the 2 major components and assigning them a color depending on the readmitted status, we would be able to discern a pattern but it was not the case.


Machine Learning

The aim of this section is to use the diabetes dataset to train some Machine Learning models using the diabete dataset in order to predict the readmission of a patient

We will therefore try to predict the value of the "readmitted" column to answer our question

The actual readmitted column contains 3 different values (No, >30, <30)

Later, we will rearrange this column in order to have a binary decision to predict

For all models, we will not have to define a random state because the Sckit Learn library uses the numpy random seed
which we defined at the very begining of this notebook.

Case 1 : Predict patient's readmission under 30 days

Preparation of the data to train and test the futurs models

In order to be able to use Machine Learning algorithms, we must have numerical values only.

Let's see which column is not numerical

We can see that we have categorical (interpreted as object type here) and boolean features that needs to be transformed.

Let's begin with making a copy of our dataset

Building of the Machine Learning oriented dataset

To fit machine learning algorithms, we need to drop the patient nbr, this column cannot be used for training model Then, we create a copy of the actual dataframe that will be formatted to use Machine Learning algorithms

Dealing with the column to predict - (Predict the patient readmission under 30 days)

As mentionned above, the actual readmission column contains 3 different values. So the idea is to map those values towards 0 or 1.

We will in first hand consider the question "Will the patient be readmitted under 30 days ?"

This involve the following change : False for the 'NO' and the '>30' value and True for the '<30' value.

Dealing with categorical features

Now, the goal is to transform categorical features to numeric features We simply assign a number for each category of the feature To do so, we use the LabelEncoder from the Sklearn library.

We want now to observe the dataset to see if the transformation has been successful

We carefully verify each feature

We can see that all the columns have numerical variables corresponding to their former category, so the method was successful

Shuffling rows

We shuffle the entire dataset in order to avoid an eventual order biais

Before start the use of the machine learning models, we can try to review the correlation matriw with the current changes of the columns and numerical value

We will retain here the variables that show the most correlation with other variables. There are the following :

  • age
  • admission_type_id
  • discharge_disposition_id
  • admission_source_id
  • time_in_hospital
  • num_lab_procedure
  • num_procedures
  • num_medication
  • number_emergency
  • number_inpatient
  • diag_1
  • diag_2
  • diag_3
  • number_diagnoses
  • A1Cresult
  • metformin
  • glimepiride
  • glipizide
  • glyburide
  • pioglitazone
  • rosiglitazone
  • insulin
  • change
  • diabeteMed
  • Diag_Circulatory
  • Diag_Respiratory
  • Diag_Digestive
  • Diag_Diabetes
  • Diag_Injury
  • Diag_Musculoskeletal
  • Diag_Genitourinary
  • Diag_Neoplasms

    There are more columns than the first time

Splitting the dataset

We import train_test_split from the sklearn library to separate our data into 2 groups in order to train our model and later verify the predictions

Scaling the dataset

We scale the data to have a same basis for all the columns

Making predictions with Machine Learning models

The idea is to try several machine learning algorithms.

Then, we will compare them in order to see which one provides the best predictions

Let's create a dataframe to summarize the following results.

We will fill that dataframe with the score and accuracy of each model

Libraries

Model 1: K-Nearest Neighbors (KNN)

We again use the sklearn library to import knn

Creating model
Computing predictions

We predict the readmitted feature using knn model applied on our test data

Evaluating the performance of the model

Now we can build a confusion matrix and use the accuracy and score to have visibility on the performance of our model

We use the seaborn library to plot a beautiful confusion matrix

We can easily create a summary of our prediction with the classification report

Adding results to the dataframe

Model 2: Logistic regression

Creating model
Computing predictions
Evaluating the performance of the model
Adding results to the dataframe

Model 3: Linear SVC

Creating model
Computing predictions
Evaluating the performance of the model
Adding results to the dataframe

Model 4: Random Forest

Creating model
Computing predictions
Evaluating the model's performance
Adding results to the dataframe

Model 5: Adaptive boosting

Creating model
Computing predictions
Evaluating the model's performance
Adding results to the dataframe

Model 6: Decision Tree

Creating model
Computing predictions
Evaluating the model's performance
Adding results to the dataframe

Model 7: ExtraTrees

Computing predictions
Evaluating the model's performance
Adding results to the dataframe

Model 8: Naive Bayes Classifier

Creating model
Calculating predictions
Evaluating the performance of our model
Adding results to the dataframe

Tuning models

Let's try to find the best hyperparameters using GridSearch

Grid Search with Adaptative Boosting

The best parameters to use with the AdaBoostClassifier are n_estimators = 300, learning_rate = 0.98. The score of the model is a little bit improved but it's not enormous.

We can update the dataframe with the new values

Grid Search with Random Forest

We can update the dataframe with the new values

Grid Search with Naives Bayes

We can update the dataframe with the new values

Grid Search with Extra Trees

We can update the dataframe with the new values

Comparing models

Let's compare all different models we tried.

Performance of each model

Plot

Barplot

As a result, we can conclude that the models predict the data relatively well since the accuracy is around 90%. However, this data is to be nuanced. Looking at the distribution of 0 and 1 data in the readmitted column, we can observe this:

There are only 8.8% of 1 label against 91.2% of 0 label. The problem is also unbalanced.

In this case, we cannot really consider these prediction models reliable.

We will repeat the same procedure by predicting whether or not a patient can be readmitted, either before or after 30 days.

Case 2 : Predict patient's readmission under and above 30 days

Dealing again with the column to predict

We will now consider the question "Will a patient be readmitted below OR above 30 days ?"

This involve that the 'NO' value will become False and the '<30' value AND the '<30' value will become True

Let's see the new repartition of the data in this column

This is better. Let's do the precedents steps with this new dataset before using Machine Learning models

Model 1: K-Nearest Neighbors (KNN)

Model 2: Logistic regression

Model 3: Linear SVC

Model 4: Random Forest

Model 5: Adaptive boosting

Model 6: Decision Tree

Model 7: ExtraTrees

Model 8: Naive Bayes Classifier

The best score is obtained with the adaptive boosting model. Let's try other hyperparameters. Maybe we can improve the performance of the model.

Tuning models

Again, let's try to find the best hyperparameters using GridSearch

Grid Search with Adaptative Boosting

The best parameters to use with the AdaBoostClassifier are n_estimators = 300, learning_rate = 0.98. The score of the model is a little bit improved but it's not enormous.

We can update the dataframe with the new values

Grid Search with Random Forest

We can update the dataframe with the new values

Grid Search with Naives Bayes

We can update the dataframe with the new values

Grid Search with Extra Trees

We can update the dataframe with the new values

Performance of each model

Plot

Barplot


Conclusion

Remarks about cases

Let's summarize what we have done yet.
We made two different hypothesis concerning our machine learning predictions:

Each hypothesis led to different results due to the different proportions of values.
Now that we run some models on both, we can compare them.

Let's concatenate both dataframe to have a holistic view of our results.

Now that we have a dataframe containing all reults from our models, we can save it to a csv file.

We can see that the differents models perform better in case 1 but as mentionned earlier it is not surprising because the two categories to predict are too unbalanced.

In case 2, we lost some accuracy but it is a much more realistic modelization